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Abstract — an array structure for high speed division 
algorithm has been described in this paper. The objective is to 
develop the division algorithm first with the basic technique 
and later enhance the performance by pipelining the execution 
process. For implementation, we consider restoring dividers 
(i.e., those that keep the actual residue value at every step). 
Three different types of division algorithms are developed 
which serve for different applications. The first algorithm is 
‘Combinatorial Array Divider’ which uses an array of 
processing units consisting of a full adder and a multiplexer. It 
is the direct implementation of hand- division method and it 
gives the basic understanding of the division process. The 
second algorithm is ‘Fully Pipelined Array Divider’ which uses 
an array of processing units along with large number of 
flip-flops for storing the intermediate results. Pipelining is one 
way of improving the overall processing performance. This 
reduces the execution time which is very helpful in certain 
real-time applications but on the contrary it increases the 
hardware, resulting in an increase in the cost and area. The 
third algorithm is ‘Iterative Restoring Divider’ which uses just 
a couple of shift registers and a control unit. This reduces the 
hardware (area and cost) but in turn it takes higher number of 
clock cycles to execute. This is preferred in some non-real-time 
applications where execution time is of least essence. A 
synthesizable model of a divider that can be implemented in 
FPGA is developed and the implementation has been 
parameterized (i.e. it can be implemented for any size of the 
operand). 

Index Terms — Combinatorial Array Divider, Fully 
Pipelined Array Divider, Integer Divider, Iterative Restoring 
Divider, parameterized, Restoring. 


I. Introduction 

Division is a complex operation whose VLSI 
implementation is generally slower and more area 
consuming than the other three basic arithmetic operations 
(i.e. addition, subtraction and multiplication). However, with 
more complex digital signal processing (DSP) algorithms 
being implemented in VLSI, the divider is increasingly 
becoming an indispensable VLSI block for digital design [6] . 
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Furthermore, the number of clock cycles for integer division 
varies depending on the operands’ values. Every 
general-purpose microprocessor of recent design provides a 
hardware support for arithmetic division. Also, in digital 
signal processors for some applications such as 
three-dimensional graphics, there are increasing demands for 
high-speed dividers [7]. However, frequently used division 
algorithms are based on sequential recurrences producing 
one quotient digit per iteration, which causes significant 
increase in computation steps and sometimes imposes severe 
limitations on system performance. 

Integer division is a critical operation in CPU design, since 
the number of clock cycles to complete an integer is usually 
very long and unpredictable. The role of division is 
becoming more and more critical, owing to the requirement 
of signed computer arithmetic, modulus computation, the 
calculation of encryption keys, and so on. Pipelining is one 
way of improving the overall processing performance of a 
processor. This architectural approach allows the 
simultaneous execution of several instructions. Pipelining is 
transparent to the programmer; it exploits parallelism at the 
instruction level by overlapping the execution process of 
instructions. It is analogous to an assembly line where 
workers perform a specific task and pass the partially 
completed product to the next worker [2] . 

This paper is organized as follows. Section-II gives the 
introduction into some standard integer division algorithms. 
Section-Ill describes the basic implementation of the 
division algorithm. Section-IV, V & VI describes the 
implementation of Combinatorial Array Divider, Fully 
Pipelined Array Divider and Iterative Restoring Divider 
respectively. The results and conclusions are presented in 
Section- VII and VIII. 

II. Integer Division 

The division is a basic arithmetic operation requiring two 
inputs Dividend (A) and Divisor (B) to produces the two 
outputs i.e. Quotient (Q) and Remainder (R) such that 
Q = int(A/B ) and R = A — Q.B under the condition 
R < B. The division is a series of subtractions of the divisor 
from the dividend producing the partial remainder values. 

The standard fixed-point algorithm follows a 
“paper-and-pencil” technique: in every iteration, it produces 
a fixed number of quotient bits. This involves the addition, 
multiplication and shift operations. For a proper division, 
normally the dividend is greater than the divisor (A > B). If 
we consider the dividend to be 
n-bits (A n _ 1 i4 n _ 2 ...A ± A 0 ) and the divisor to be m-bits 
...B 1 B 0 ) where n > m then the quotient will be 
of n-bits ( Qn-iQn-2 ■■■QiQo) and the remainder will be of 

m-bits (R m -iR m -2 -Rifio)- 

Many arrays for division operation have been proposed 
and they can be broadly classified into two categories: (i) 
restoring and (ii) non-restoring. In restoring division, the 
divisor is subtracted from the dividend (or from the previous 
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remainder); if the remainder is negative, the previous 
dividend is restored and the quotient bit is taken as zero. 
Otherwise the quotient bit is one and the process is continued 
without any change. In non-restoring method, the division 
process is carried out without restoring the previous dividend 
irrespective of the sign of the result. The organizations of two 
types of divisors are quite similar and only the designs of the 
basic cells are slightly different. But later on it was proved 
that the speed of the two types of arrays is almost equal and 
the restoring technique gives a true remainder. In a divider 
array the subtraction can be achieved either directly or by 
adding 2’s complement of the divisor. 

III. Implementation 

Given two unsigned numbers A (n-bits) and B (m-bits), we 
wish to design a circuit that produces two outputs Q (n-bits) 
and R (m-bits), where Q is the quotient of A/B and R is the 
remainder. This can be implemented by shifting the digits in 
A to the left, one digit at a time, into a shift register R. After 
each shift operation, R is compared with B. If R > B, a 1 is 
placed in the appropriate bit position in the quotient and B is 
subtracted from R. Otherwise, a 0 bit is placed in the 
quotient. For the implementation, we follow the 
hand-division method. We grab bits of A one by one and 
comparing it with the divisor. If the result is greater or equal 
than B, then we subtract B from it. This algorithm is 
described using pseudo-code. The notation RIIA is used to 
represent a 2n-bit shift register formed using R as the 
left-most n bits and A as the right-most n bits. 

R = 0 ; 

for i = 0 to n — 1 do 
Left-shift R \ \ A ; 
if R > B then 

<?i = 1 ; 

R = R-B; 

else 

<7 i = 0; 

end if ; 

end for ; 


A. Subtraction of Unsigned Numbers Represented With 
n-Bits : T=R-B. 

This point deserves special attention as the divider hardware 
relies on the result obtained here. We usually determine the 
sign of the subtraction by sign-extending R and B so that they 
are in 2’s complement representation with n + 1 bits. Then, 
we do T = R + not(B ) + 1 , where T = t n t n _ 1 t n _ 2 ... t 0 , 
and t n determine the sign of the subtraction operation. 
However, when R and B are unsigned, we can compute 
not(B ) without sign-extending B. We then analyse cout n : 

(i) If cout n = 1 -> R > B (and R — B is equal to T = 
t n t n -itn -2 — to, ke. it i s an unsigned number with nbits). 

(ii) If cout n = 0 -> R < B (here R — B is NOT equal 

to T LiTn— lLi — 2 ■■■ t-o ) 


we sign-extend R and B to n + 1 bits turning them into two 
numbers in 2’s complement representation. The 
sign-extension actually amounts to zero-extending. 
Then,R = 0r n _ 1 r n _ 2 ...r 0 and B = 0b n _ 1 b n _ 2 ...b 0 . In 2’s 
complement, we have that: 0 < R and B < 2 n — 1 . It 
follows that: — (2 n — 1) <R — B<2 n — 1. Thus R — B can 
be represented in 2’s complement with n + 1 bits (as 
expected). Let K = not(B) + 1 and is represented by K = 
k n k n -ik n -2 ...k 0 . In unsigned representation, K = 2 n+1 — 
B. 

Equ.l shows the operation R — B by using: +K , where 
K = not(B) + 1 .We let 1 be held by cin . If B = 0 then 
K = 2 n+1 (here is represented by the second operator as well 

US in 1) 

1 Cin 

R: O^n-i^n-2* • . R: Or^r^. . .r 0 + 

Bi 0b n -ib n -2 • • »bo K: lk n _^k n _2* • »ko 

equ. 1 : Operation R — b = R + K = R + not(B) + 1 


Table I: To determine the value of k n _ ± : 


Case 

K 

^n^n— l^n— 2 ■■■ ^0 


B * 0 
(or B 
>0) 

2 n 

100...0 


2 n + 1 

100...1 

k„ = 1 

• • . 

• • • 

2 n+1 - 1 

111...1 


B = 0 

2 n+1 

1000...0 

kn 0 


Case-1: R — B < 0 


Since R > 0 Implies that B > 0 and hence k n = 1 
We have 

n-l n-1 

R + 2 n+1 — B = > r i 2 i + 2 n+1 - 


X -B= ^1-2* 


^ bi2‘ < 2 n+1 


1=0 


1 = 0 


R 


>n+l 


n-l n-l 

+ K = ^ r{2' + k n 2 n + ^ ki2 f < 2 1 

i=0 i=0 

Hence, 

n-l n-l 

^ r i 2 i + ^ < 2 n 

i=0 i=0 

The n + 1 bit sum (considering the operation as 
unsigned) of R and K is lower than 2 n+1 . Then, 
there is no overflow in the n + 1 bit unsigned sum. 
Thus c n+1 = 0. 

The n bit sum (considering the operations as 
unsigned) of R and k n _ 1 k n _ 2 ... k 0 is lower than 2 n . 
Thus, there is no overflow of the n bit unsigned 
sum. Thus c n = 0 . 


Case-2: R — B > 0 


B. Demonstration of the computation of R-B with n bits: 

We have 0 < R and B < 2 n — 1 where R and B are 
unsigned binary numbers represented by R = r n _ 1 r n _ 2 ... r 0 
and B = b n _ 1 b n _ 2 ...b 0 respectively. To compute R — B, 


• We have 

n-l 

R + 2 n+1 — B = Vr i 2 i + 2 n+1 

i=0 


n-l 

y > 2 n+1 

i=0 
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n-l n-1 

2 n + V ki2‘ > 2 n+1 

i=0 i=0 

• Hence 

n-l n-l 

+ Y k { 2' > 2 n+1 - k n 2 n 

i=0 i=0 

• The n + 1 bit sum (considering the operation as 
unsigned) of R and K is greater or equal to 
than 2 n+1 . Then, there is overflow of the bit n + 1 
unsigned sum. Thus c n+1 = 1 . 

• For the n-bit sum of R and k n _ 1 k n _ 2 ... k 0 , we have 
two cases 

If B > 0, then k n = land hence 

n-l n-l 

r{2 1 + y* ( 2‘ > 2 n+1 - 2 n 

i = 0 i = 0 




R+ K = 


= £ r.2 1 


+ k 


n 


n-l 


Z 


r{2 1 + 


n-l 

Z 

i = o 


ki 2 l > 


2 


n 


If B = 0, then k n = 0 and hence 

n-l n-l 

V r{2 1 + y k i 2 i > 2 n+1 

i = 0 i=0 

• In both cases, the n-bit sum (considering the 
operands as unsigned) of Rand k n _ 1 k n _ 2 ...k 0 is 
greater or equal to than 2 n . So, there is overflow of 
the n-bit unsigned sum. Thus c n = 1 when R > B. 

For the 2’s complement operation of R-B with n + 1 bits, 
there is no overflow of the subtraction as c n = c n _ ± . 
For R — B > 0 : The result T = R — B is a positive number, 
thus T n = 0 . Therefore t n _ 1 t n _ 2 ...t 0 contains R — B in 
unsigned representation. 

In conclusion: (i) If R < B -> c n = 0 , then the n bits 

tn-itn -2 — to do not contain the result R — B. (ii) If R > 

B -> c n = 1 , then the n bits t n _ 1 t n _ 2 ...t 0 do represent 
R — B in unsigned representation. 

C. Restoring Array Divider For Unsigned Numbers. 

Let A and B be two positive integers in unsigned form of 
representation. A = a N _ 1 a N _ 2 ... a 0 with N bits, and 

B = b M _ 1 b M _ 2 ... b 0 with M bits, with the condition 

that N > M. We have A = (B X Q) + R, where Q is the 
quotient and R is the remainder. In this parallel 
implementation, the result of every stage is called the 
residue R { . The Fig. 1 depicts the parallel algorithm with 
N stages. For each stage , i = 0,1, ... N — 1, we have 

• Ri : denotes the output of stage i which 

represents the residue after each stage. 

• Yp denotes the input of stage i which holds the 
minuend at each stage. 

For the next stage, we append the next bit of A to R { . This 
becomes Y i+1 (the minuend) Y i+1 = Ri | |a N _ 1 _ I fori = 
0,1, ..,N — 1. At each stage i, the subtraction Y t — B is 
performed, (i) \fY { > Bthen R { =Y { — B, (ii) If Y { < 
Bthen R { = Yj 


Table II: Restoring algorithm for division 


Stage 

Yt 

Computation of R t 

Ri 

bits 

0 

Y 0 = a N _ i 

R 0 = Y 0 - B, if Y 0 > B 

1 




R n = Y n ,ifY„<B 


1 

Fl — ^0 1 1 a N- 2 

R 1 = Y 1 — B, if Y 1 > B 
R, =Y l ,ifY l <B 

2 

2 

Y 2 = RlII a N - 3 

R 2 = Y 2 - B, if Y 2 > B 
R? = Y ? , if Y ? < B 

3 

• • • 

• • • 

• • • 

• • • 

M-l 

Ym-i 

~ Rm-2 1 1 a M—N 

Rm-i ~ Y M - 1 — B, if Y m _ 1 

> B 

Rm-'i = Y M - 1 ; if Y m _ a < B 

M 


M bits 

< > 




M V V U 


^N-2 

" J y n-i 


Stage M+1 


J J J J"Rn- 


M+l bits 

Fig. 1 : Parallel implementation algorithm 


Since B has M bits, the operation Y { — B requires M bits 
for both operands. To maintain consistency, we let Yj be 
represented with M bits. R^ Represents the output of 
each stage. For the first M stages, Rf requires i + 1 bits. 
However, for consistency and clarity's sake, since 
R t might be the result of a subtraction, we let R { use M 
bits. 

For the stages in between 0 to M — 2, R { is always 
transferred onto the next stage. Note that we transfer R { 
with M — 1 least significant bits. There is no loss of 
accuracy here since R { at most requires M — 1 bits for 
stage M — 2 . We need R^ with M — 1 bits since Y i+1 
uses M bits. 
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For the stages in between M — 1 to N— 1, Starting from 
stage M — 1 , requires M bits. We also know that the 
residue requires at most M bits (maximum value 
is 2 M — 2 ). So, starting from stage M — 1 we need to 
transfer M bits. As Y i+1 now requires M + 1 bits, we 
need M + 1 units starting from stage M. 

To implement the operation — B we use a 
subtractor. If Yj > B then couti = 1, and when < B -> 
couti = 0 . This couti becomes a bit of the 
quotient: Qi = cout^^. This quotient Q requires N 
bits at the most. Also, the final residue is the result of 
the last stage. The maximum theoretical value of the 
residue is 2 M — 2 , thus the residue R requires M bits 
where R = R N _ X ■ Also, note that we should avoid a 
division by 0. If B=0, then, in our circuit: Q = 2 N — 1 
and R — a jyj i a jyj 2 ■■■ Uq. 


IV. Combinational Array Divider 


units, while the next stages requires 5 units. This is fully 
combinatorial implementation. Each level computes R t . 
It first computes Y t — B. When Y t > B -> couti = 1 > and 
when Y t < B -> couti = 0 ■ This cout t is used to 
determine whether the next R t is Y t — B or Y t . Each 
Processing Unit (PU) is used to process Y t — B, one bit at 
a time, and to let a particular bit of either Y t — B or Y t be 
transferred on to the next stage. 


V. Fully Pipelined Array Divider 



The Fig. 2 shows the hardware of this array divider for 

N =8 and M=4. Here the first M=4 stages only require 4 Fig. 2. Combinational Array divider block schematic 


0 bj 0 h_ 0 bg 37 
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As shown in Fig. 3 the hardware core of the fully 
pipelined array divider with its inputs, outputs, and 
parameters. The Fig. 4 shows the internal architecture of 
this pipelined array divider for N=8, M=4. Note that the 
first M=4 stages only require 4 units, while the next 
stages require 5 units. Note that the enable input 'E' is 
distributed across the enable inputs of all flip flops. The 
exception is the shift register on the left, which is used to 
generate the valid output. 


1 

V^L 1 

L-H to theenable 
of all other 

registers Cq4 


E 


-1 


[ 5 h □ 


E-1 


□ 


□h □ 


□ 


rn-i □ 


q 7 



a 4 

X 


i\n 


nn 


ivn 


q« 


a 3 

X 


0 


□ 


□ 


ID 


qs 


a 2 









y 

:z\ 

y 


y 






. y 

L V 

L T 

< 

r - 

f 

\r 


v~ 


v~ 

T~ V 

L V 

L V 

L 7 

L 7” 


fvn 


□ 


jx 3 3 \ \ |x 31 


a i 

X 




a 0 

X 


fvn 



q< 


qs 


qs 


qi 


qo 


ro 


Fig. 4: Schematic of Fully Pipelined Array divider 
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VI. Iterative Restoring Divider 

The Fig. 5&6 shows the iterative hardware architecture 
and the state machine. Here, Ri is always held at 
register R. The subtractor computes — B . This 
requires M + 1 bits in the worst case. If Y^ > 

B then R t = Y t — B . Y^ here is the minuend. Y t — B is 
loaded onto register R. Note that only M bits are needed. 

If Yi < B, then = Y p Here only Y^ is loaded onto 

register R. This is done by just shifting <2^-1 into 
register R. Here, R requires M bits since it holds the 
residue at every stage. Also, since we always shift 
couti onto register A, the quotient Q is held at A in the 
last iteration. 



E DA DB 



done Q R 


Fig. 6: Iterative Divider Architecture 
VII. Result 

The described divisor models were implemented in the 
FPGA device XC3S400-TQ144 (Xilinx Spartan-3 family) 
with speed grade -5 and in in the FPGA device XC4VFX12 
-SF363 (Xilinx Virtex-4 family) with speed grade -12. The 
development system ISE v 14.1 with default settings was 
used. The implementation results - Maximum combinational 
path delay (for Combinatorial Array Divider) and Minimum 
period, Maximum Frequency, Minimum input arrival time 
before clock and Maximum output required time after clock ( 
for Fully Pipelined Array Divider and Iterative Restoring 
Divider) obtained by Synthesize-XST are given in following 
tables (Table 3 and Table 4). 

Comparison of AREA: 

As seen in the Fig. 8 & 9, the amount of area required to 
implement on these devices have been compared. These 
comparisons are done for the three designs based on the 
implementation in the FPGA device XC3S400-TQ144 using 
Implementation Design- Analyze Timing/Floor plan Design 
(Plan Ahead). 


Fig. 5: State Machine of Iterative Restoring Divider 
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Table 3: Comparison table of timing analysis for Spartan 3 


Device 

Family 

Divide 
nd bits 

Divisor 

bits 

Division 

Algorithm 

Maximum 
combinational 
path delay (ns) 

Minimum 
period (ns) 

Maximum 

Frequency 

(MHz) 

Minimum input 
arrival time 
before clock (ns) 

Maximum 
output required 
time after clock 

(ns) 




Combinational 

7.850 






2 

1 

Pipelined 


2.321 

430.765 

2.160 

7.735 




Iterative 


3.451 

289.809 

3.611 

6.441 




Combinational 

9.150 






4 

2 

Pipelined 


3.538 

282.622 

2.804 

10.302 




Iterative 


4.222 

236.860 

3.679 

6.456 




Combinational 

45.999 






8 

4 

Pipelined 


4.998 

200.094 

3.238 

12.045 

Spartan-3 



Iterative 


6.095 

164.077 

4.578 

6.456 



Combinational 

127.412 






16 

8 

Pipelined 


6.399 

156.266 

5.044 

17.465 




Iterative 


6.418 

155.818 

4.247 

6.544 




Combinational 

669.188 






32 

16 

Pipelined 


9.475 

105.541 

9.421 

28.046 




Iterative 


8.330 

120.053 

5.058 

6.895 




Combinational 

2691.854 






64 

32 

Pipelined 


11.175 

89.484 

9.207 

48.893 




Iterative 


10.720 

93.284 

5.339 

7.159 


Table 4: Comparison table of timing analysis for Virtex 4 


Device 

Family 

Dividend 

bits 

Divisor 

bits 

Division 

Algorithm 

Maximum 
combinational 
path delay (ns) 

Minimum 
period (ns) 

Maximum 

Frequency 

(MHz) 

Minimum input 
arrival time 
before clock (ns) 

Maximum 

output 

required time 
after clock (ns) 




Combinational 

4.871 






2 

1 

Pipelined 


0.885 

1130.199 

1.492 

4.467 




Iterative 


1.619 

617.608 

2.117 

3.856 




Combinational 

5.586 






4 

2 

Pipelined 


1.495 

668.762 

1.843 

5.532 




Iterative 


1.966 

508.660 

2.142 

3.856 




Combinational 

22.135 






8 

4 

Pipelined 


2.219 

450.592 

2.100 

6.653 

Virtex-4 



Iterative 


2.627 

380.713 

2.179 

3.856 



Combinational 

66.053 






16 

8 

Pipelined 


2.845 

351.512 

3.161 

8.992 




Iterative 


2.862 

349.424 

2.348 

3.964 




Combinational 

289.622 






32 

16 

Pipelined 


4.058 

246.418 

4.936 

13.416 




Iterative 


3.619 

276.304 

2.863 

4.074 




Combinational 

1160.928 






64 

32 

Pipelined 


14.487 

69.029 

5.379 

22.442 




Iterative 


4.788 

208.862 

3.000 

4.230 
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Fig. 7: Various parameter graph for Spartan 3 


Fig. 8: Various parameter graph for Virtex 4 
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Fig. 9: Area comparison for Spartan 3 





Fig. 9: Area comparison for Virtex 4 


VIII. CONCLUSION 

The paper introduces three types of synthesizable model of 
the divider that can be implemented in any FPGA devices. In 
our approach the designs have been targeted to Spartan-3 & 
Virtex-4 and the results have been compared respectively. 
The Maximum combinational path delay (ns) in the 
‘Combinatorial Array Divider’ increases exponentially as the 
number of dividend (or divisor) bits increases. Hence, the 
overall execution time increases exponentially with the 
increase in the input operand value. The minimum period 
(ns) time for execution in ‘Fully Pipelined Array Divider’ is 
less than the execution time in ‘Iterative Restoring Divider’ 
only when the number of dividend/divisor bits is less than 
16/8. But, when these number of bit values are increased, 
then the ‘Iterative Restoring Divider’ works effectively as 
compared to ‘Fully Pipelined Array Divider’. This inference 
is justified by the results obtained by both the Maximum 
frequency (MHz) and the Minimum input arrival time before 
clock: this is because the Maximum output required time 
after clock (ns) remains almost constant for ‘Iterative 
Restoring Divider’ whereas it increases exponentially for 
‘Fully Pipelined Array Divider’. The ‘Iterative Restoring 
Divider’ requires least amount of area to be implemented, 
while ‘Combinatorial Array Divider’ requires moderate and 


the ‘Fully Pipelined Array Divider’ requires the maximum 
amount of area to be implemented. 
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